Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

116

Statistics and Causation

No examination of numbers is complete without the careful consideration of how

the measurements were obtained, encompassing the observational or experimental

setup. Often the shape of a distribution of measurements is crucially able to distin-

guish between different models of how the actual numbers could arise, and sometimes

the extremes of the distribution are of especial importance in making the distinction;

since the numbers are sparse, conﬁdence in the reliability of their values is of especial

importance.

Statistics often focuses on establishing correlations without enquiring into causes.

These are discussed in the next section.

10.2

The Calculus of Causation

Although in Chap. 6 the goal of science was rather dispassionately stated as “gen-

erating conditional information in the form of hypotheses and theories relating the

observed facts to each other using axiom systems” (Sect. 6.1.2), this does not really

capture the enormously strong desire of man to understand the causes of things. As

Max Planck has remarked, ²“As the law of causality immediately seizes the awaken-

ing soul of the child and causes him indefatigably to ask ‘Why?’ so it accompanies the

investigator through his whole life and incessantly sets him new problems”. Statis-

tics originated in a search for causation, but ended up becoming a tool to establish

correlations between variables, as, essentially, a data-reduction exercise. This view

is epitomized by Karl Pearson’s remark that “data is all there is to science”, and

echoed by R. A. Fisher, who saw statistics as the study of methods of data reduction.

As such, one might even question whether it could generate new knowledge, since

once the structural framework of the procedures and calculations was established,

the rest would be merely a matter of deduction.

Planck’s apothegm echoes Virgil’s felix qui potuit rerum cognoscere causas, and

an important step on the road to getting to grips with causation as something beyond

association and correlation was Sewall Wright’s path analysis. ³

Statistics is rooted in observation, for which probabilistic notation is well suited.

The probability of an event can be established by observing its frequency of occur-

rence. Events can be linked via conditional probability (Sect. 9.2.2). Thus, in agron-

omy, one might ask the question “what is the probability of an xx-fold enhanced

yield (upper YY), given that it rained for the entire month of June?” This can be expressed

as upper P left brace upper Y vertical bar upper R right braceP{Y|R}. Observation might lead to the establishment of a correlation between

crop yield and rainfall upper RR. A similar question, “what is the probability of an xx-fold

enhanced yield, given that the ﬁeld has been fertilized with gypsum?” might be

addressed in a similar fashion, leading to the establishment of a correlation between

crop yield and fertilizer dose. But clearly fertilization is a human intervention. It was

2 Planck (1932).

3 Wright (1921, 1983), see also Burks (1926), Good (1961), Pearl (1994, 2020). The famous guinea

pig experiments are described in Wright (1920).